Compression of Unicode Files
نویسندگان
چکیده
The increasing importance of Unicode for text files, for example with Java and in some modern operating systems, implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. However it is not clear that data compressors designed for 8-bit byte data are well matched to 16-bit Unicode data. This paper investigates the compression of Unicode files, using a variety of established data compressors on a mix of genuine and artificial Unicode files. It is found that while Ziv-Lempel and unbounded context compressors work well, finite-context compressors are less satisfactory on Unicode. Tests with a simple special compressor intended for 16-bit data show that it may be useful to design compressors specifically for Unicode files.
منابع مشابه
A survey of Unicode compression
The Unicode (ISO/IEC 10646) coded character set is the largest of its kind.1 Almost a million code positions are available in Unicode for formal character encoding, with more than 137,000 additional code positions reserved for private-use characters. This is quite a change from the 128 or 256 characters available in 8-bit “legacy” code pages, or even the thousands available in East Asian double...
متن کاملArabic Text Steganography Using Unicode of Non-Joined to Right Side Letters
Email: [email protected], [email protected] Abstract: Steganography is a technique for hiding data in media in a way that makes its existence hard to detect. Text files are a preferable format for use in steganography due to the small storage size of such files. This paper presents an Arabic text steganographic algorithm based on Unicode. The algorithm imposes a minimal change on connected le...
متن کاملPerformance Improvement Of Bengali Text Compression Using Transliteration And Huffman Principle
In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant perfo...
متن کاملAn Image Lossless Compression Patent
The present general lossless compression algorithm is not effective for the compression effect of JPEG files. In this article, the lossless compression method combining the shuffling algorithm with the lossless compression algorithm and a new shuffling algorithm are proposed, and this new algorithm could compresses the JPEG files without losses, and the result indicates that this algorithm can ...
متن کاملAccordion Arrays: Selective Compression of Unicode Arrays in Java
In this work, we present accordion arrays, a straightforward and effective memory compression technique targeting Unicode-based character arrays. In many non-numeric Java programs, character arrays represent a significant fraction (30-40% on average) of the heap memory allocated. In many locales, most, but not all, of those arrays consist entirely of characters whose top bytes are zeros, and, h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998